Finding Things in Files 1 


“I have a very large file hut I only need to extract certain parts of it. How do I do that?” 

Extracting and reporting is a standard task in any bioinformatic pipeline. This session we will be about a 
few ways of using standard tools for this task. 

Structured data? 


Apa 

0.123 

TRUE 

food 

Apa 

0.223 

FALSE 

food 

Apa 

0.323 

TRUE 

ffood 

Bpa 

0.123 

FALSE 

food 


Then we can cut, slice, and sort! 

Working with column data (CSV) 

• Structured data in a proprietary format (MS Excel)? 

• Export to CSV (Comma Separated Values)! 

Working with column data (CSV) 

• We’ll work on a stream, not the entire file! 

• File size is not a show stopper! 

• We can work on compressed files, “without uncompressing” them! 


Working with column data (CSV) 

Apa,0.123,TRUE,food 
Apa,0.223,FALSE,f ood 
Apa,0.323,TRUE,ffood 
Bpa,0.123,FALSE,food 


Working with column data (CSV) 

"Apa",0.123,TRUE,"food, spam, and spam" 
"Apa",0.223,FALSE,"food, spam, and spam" 
"Apa",0.323,TRUE,"ffood, spam, and spam" 
"Bpa",0.123,FALSE,"food, spam, and spam" 


Working with column data (CSV), tab-separated 


Apa 0.123 TRUE 

Apa 0.223 FALSE 

Apa 0.323 TRUE 

Bpa 0.123 FALSE 


food, spam, and spam 
food, spam, and spam 
ffood, spam, and spam 
food, spam, and spam 
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Working with column data (CSV): cut 

$ cut -fl,2,5 data.tsv 
$ cut -d, -fl,2,5 data.csv 
$ cut -fl,8- data.tsv 

Working with column data (CSV): awk 

awk '{actions}' 

$ awk '{print $1,$5}' data.tsv 
$ awk -F"," '{print $1,$5}' data.csv 
$ awk -F"," '{print $5,$5,$1}' data.csv 
$ awk '{print "Foo",$3+$6}' data.tsv 

Working with column data (CSV): awk 

$ awk '{print $1,$NF}' data.tsv 

$ awk -F"," '{print NF}' data.csv 
Watch the field delimiter! 

$ awk '{print NF}' data.tsv 

$ awk -F"\t" '{print NF}' data.tsv 

Working with column data (CSV): awk 

awk '/search/BEGIN{actions}{actions}END{actions}' 
$ awk 'BEGIN{QFS=","}{print $1,$2}' data.tsv 

$ awk '$2 == "Bpa" {print $1,$2,$5}' data.tsv 

$ awk '{sum+=$3}END{print sum}' data.tsv 

$ awk '{sum+=$3}END{print sum/NR}' data.tsv 

Working with column data (CSV): awk 

awk '/search/BEGIN{actions}{actions}END{actions}' 
$ awk '$1 > 10000 {print $1,$2}' lw.nex.runl.p 

$ awk '$1 > 10000 {print $1,$2}' lw.nex.runl.p |\ 
gnuplot -p -e "plot '-' using 1:2 with lines" 
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Working with column data (CSV): sort 


$ sort data.tsv 

$ sort -r data.tsv 

Sort data based on column 2 

$ sort -k2 data.tsv 

$ sort -t, -k2 data.csv 

Working with column data (CSV): sort 

Watch the field delimiter! 

$ sort -klO data.tsv 

$ sort —t$'\t' -klO data.tsv 


Working with column data (CSV): sort 

Sort data based on column 2, then on column 4 
$ sort —t$'\t' -k2,2 -k4 data.tsv 

$ man sort 

Working with column data (CSV): sort 

Print the unique values in column 7 
$ cut -d, -f7 data.csv I sort — unique 
“Version” sorting 
$ sort —t$'\t' -k8 -V data.tsv 

Printing lines, first and last 

$ head -3 data.csv 
$ tail -3 data.csv 

Printing lines, specifics 

Print line 2 
$ sed '2!d' data.csv 

$ sed -n '2p' data.csv 

$ awk 'NR == 2' data.csv 
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Print lines 2 to 5 
$ sed 1 2,5!d' data.csv 

$ sed -n '2,5p' data.csv 

$ awk 'NR >= 2 kk NR <= 5' data.csv 

Printing lines, conditional 

$ awk '/Apa/' data.csv 
$ sed -n '/Apa/p' data.csv 
$ sed '/Apa/!d' data.csv 

Printing lines, conditional: grep 

$ grep 'Apa' data.csv 
$ grep -v 'Apa' data.csv 
$ grep -E 'ApalCpa' data.csv 
$ egrep 'ApalCpa' data.csv 

Finding things in files, using REGULAR EXPRESSIONS 

Regular expressions, regexp’s, are patterns constructed with the aim of matching strings in texts. 
These patterns are made by combining regular characters with “special” characters: 

\ * $ . !?* + () C ] -C > 


One example: 

~[A-Z]{6}$ 

Will match “A string of capital letters, six letters long, that is located at the very beginning of the line, and is 
followed by nothing else on the line”. 

Match 

m// 

// 

/match/ 

Example: 
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/apa/ 

Will match any string containing the string apa 
(Remember? $ awk '/Apa/' data.csv) 

Match and substitute: 

s/// 

s/match/replace/ 

Example: 

s/apa/bpa/ 

Will replace the (first) occurrence of apa with bpa. 

To replace all occurrences (on the line), use g (for global): 
s/apa/bpa/g 

“Engines” or tools that can use regular expressions: 

• Scripting languages (Perl, Python, Ruby, JavaScript, R, ...) 

• Command line tools: 

— sed 

— awk 

— grep 

— perl 

• Text editors (Vim, Emacs, ...): 

“Engines” or tools that can use regular expressions: Examples 

$ awk '/~ [A-Z]{6}$/' some.text 
HAMLET 

$ sed -n -E '/~ [A-Z]{6}$/p' some.text 
HAMLET 

$ grep -E '~ [A-Z]{6}$' some.text 
HAMLET 

$ grep '~ [A-Z]\{6\}$' some.text 
HAMLET 

$ perl -ne 'print if /“ [A-Z]{6}$/' some.text 
HAMLET 

Meta characters - characters with special meanings 
\ * $ . !?*+() []{> 
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Non-Printable Characters 


• Tab \t 

• Newline \n 

• Line feed \r 

• ... 

Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n. 

Character Classes 

[ae] match an a or an e 

[0-9] matches a single digit between 0 and 9 

[+*] search for a star or plus 

Character Classes - Negations 

[*x] matches any character except x 

[~0-9\r\n] matches any character that is not a digit or a line break. 

Shorthand Character Classes 

\w stands for “word character”. It always matches the ASCII characters [A-Za-z0-9_]. 

\s stands for “white space character” 

\d is short for [0-9] 

The above three shorthands also have negated versions 

\D is the same as [~\d] 

\W is short for [~\w] 

\S is the equivalent of [~\s] 

Dot 

The dot . matches any single character. 

The only exception are line break characters. 

To match a period ., we need to use the back slash: \. 

Anchors 

match at the beginning 
$ match at the end 
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Word Boundaries 


\b 

Example: foo\b will match foo bar but not foobar 

Alternation 

cat I dog I mouse I fish 

Returns a match if one of them matches (cat or dog or ...) 

Repetition 

? match once 
+ one or many times 
* zero or many times 

{min,max} e.g. [z]{2,4} match zz, zzz, or zzzz 

{0,1} is the same as ? {0,} is the same as * {1,} is the same as + 

Omitting both the comma and max tells the engine to repeat the token exactly min times. 

Grouping Capturing with back references 

() the match is captured in \1 or $1 
0 () the matches are captured in \1 \2, ... 

(()) inner is captured in \1, outer in \2, ... 

You can reuse the same back reference more than once. ( [a-c] )x\lx\l matches axaxa, bxbxb and cxcxc 

Example - Optional Items 

colou?r matches both colour and color 
Nov (ember)? matches Nov and November 

Example - capturing with back references 

$ grep 'These words' some.text 

These words, like daggers, enter in mine ears; 

if ( /(\w+\s+\w+), enter in mine (\w+)/ ) { 
print "$2 $l\n"; 

} 

$ perl -ne 'if (/(\w+\s+\w+), enter in mine (\w+)/) {print "$2 $l\n"}' some.text 
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Example - fix the misspelling 

Someone misspelled ‘food’ 

$ cut -f5 data.tsv 

foood 

food 

ff ood 

food 

food 

fooood 

food 

fod 

food 

$ sed 's/foood/food/' data.tsv 

Example - fix the misspelling 

$ cut -f5 data.tsv I sort -u 

ff ood 

fod 

food 

foood 

fooood 

$ for f in $(cut -f5 data.tsv I sort -u) ; do 
sed -i -e "s/$f/food/" data.tsv 
done 


$ sed -E 's/f(f)?o( [o]+)?d/food/' data.tsv 
$ sed -E ' s/f{1,}o{l,}d/food/' data.tsv 
$ sed -E 's/f+o+d/food/' data.tsv 

Example - finding that specific column 

What is the column number for header “NNI (Tau) $acc_run2” in my .mcmc file? 
$ grep '~Gen' *.mcmc I sed 's/\t/\n/g' I nl 
Cut out the column an plot it: 

$ cut -fl,27 *.mcmc I gnuplot -p -e "plot u 1:2 w 1" 

Example - What did Hamlet say? 

$ grep -A 5 HAMLET some.text 


$ awk '/HAMLET/,/QUEEN/' some.text 



$ sed '/HAMLET/,/QUEEN/!d' some.text 


$ sed -n '/HAMLET/,/QUEEN/p' some.text 

$ sed '0,/HAMLET/d;/QUEEN/Q' some.text 

$ sed -e '0,/HAMLET/d 1 -e '/QUEEN/Q' some.text 
$ perl -ne 'undef $/;print $1 if /HAMLET(.*?)QUEEN/s' some.text 

Example - Counting barcodes in fastq files 

$ bunzip2 -c illumina_4M_2.fastq.bz2 I \ 
head 

$ bunzip2 -c illumina_4M_2.fastq.bz2 I \ 
egrep -o '#[ACGTN]+' I \ 
head 

$ bunzip2 -c illumina_4M_2.fastq.bz2 I \ 
egrep -o '#[ACGTN]+' I \ 
sed 's/#//' I \ 
head 


Example - Counting barcodes in fastq files, contd. 

$ bunzip2 -c illumina_4M_2.fastq.bz2 I \ 
egrep -o '#[ACGTN]+' I \ 
sed 's/#//' I \ 
sort I \ 
head 

$ bunzip2 -c illumina_4M_2.fastq.bz2 I \ 
egrep -o '#[ACGTN]+' I \ 
sed 's/#//' I \ 
sort I \ 
uniq -c 

Example - do demultiplexing 

$ bunzip2 -c illumina_4M_l.fastq.bz2 I \ 

grep -A3 —no-group-separator '#CTTGTG' > CTTGTG.l.fq 

$ bunzip2 -c illumina_4M_2.fastq.bz2 I \ 

grep -A3 —no-group-separator'#CTTGTG' > CTTGTG.2.fq 

Example - Repetition, Greediness and Laziness 

. + will match as long (greedy) as possible 

. +? will match as short (lazy) as possible 
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Whole line: 

<tag>some text<tag>another text</tag>the rest</tag> 

$ perl -nle 'print $1 if /<tag>(.+)<\/tag>/' gl.txt 

some text<tag>another text</tag>the rest 

$ perl -nle 'print $1 if /<tag>(.+?)<\/tag>/' gl.txt 

some text<tag>another text 

$ perl -nle 'print $1 if /<tag>([~<]+?)<\/tag>/' gl.txt 
another text 

More examples ? 
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